A Preprocessing and Analyzing Method of Images in PDF Documents for Mathematical Expression Retrieval

نویسندگان

  • Xuedong Tian
  • Botao Yu
  • Jing Sun
چکیده

PDF documents are the important information resources for a mathematical expression retrieval system. As a major component of PDF documents, the image objects must be converted to coded form with the help of character recognition and document analysis technology firstly for content based searching. Therefore, the quality of these images becomes the key factor which decides the correctness in this conversion process. Considering the characteristics of PDF images and mathematical expressions, a preprocessing and analyzing method was proposed which includes the modules of PDF image extraction, graying, binarization, denoising, skew correction and layout parameter detection. The features of mathematical expressions were adequately considered to avoid the information loss in image converting process and the adverse interference both to the analysis and correction process resulted from formulas. The experimental results show that the method is effective in improving the accuracy and efficiency of document image recognition, analysis and retrieval.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Un método eficaz de indexación para la recuperación de imágenes en archivos en formato pdf

One of the areas which is presently awakening more interest among researchers and users of Information Retrieval systems is the retrieval of documents containing images which are relevant to a need for information. In this case, the main objective is not the retrieval of the documents relevant to the user’s need for information, but the achievement of the images relevant to that need for inform...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Extraction, layout analysis and classification of diagrams in PDF documents

Diagrams are a critical part of virtually all scientific and technical documents. Analyzing diagrams will be important for building comprehensive document retrieval systems. This paper focuses on the extraction and classification of diagrams from PDF documents. We study diagrams available in vector (not raster) format in online research papers. PDF files are parsed and their vector graphics com...

متن کامل

Content Based Radiographic Images Indexing and Retrieval Using Pattern Orientation Histogram

Introduction: Content Based Image Retrieval (CBIR) is a method of image searching and retrieval in a  database. In medical applications, CBIR is a tool used by physicians to compare the previous and current  medical images associated with patients pathological conditions. As the volume of pictorial information  stored in medical image databases is in progress, efficient image indexing and retri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014